Big Data Engineer Resume
Dallas, TX
SUMMARY
- me has around 7+ years of IT experience in software development and support wif experience in developing strategic methods for deploying Big Data technologies to efficiently solve Big Data processing requirement.
- Expertise in Hadoop eco system components HDFS, Map Reduce, Yarn, HBase, Pig, Sqoop, Spark, Spark SQL, Spring boot, Spark Streaming, and Hive for scalability, distributed computing, and high performance computing.
- Experience in using Hive Query Language for data Analytics.
- Experienced in Installing, Maintaining and Configuring Hadoop Cluster.
- Strong knowledge on creating and monitoring Hadoop clusters on Amazon EC2, VM, Hortonworks Data Platform 2.1 & 2.2, CDH3, CDH4 Cloudera Manager on Linux, Ubuntu OS.
- Capable of processing large sets of structured, semi - structured and unstructured data and supporting systems application architecture.
- Having Good knowledge on Single node and Multi node Cluster Configurations.
- Strong knowledge in NOSQL column oriented databases like HBase, Cassandra, MongoDB, and Mark Logicand its integration wif Hadoop cluster.
- Expertise on Scala Programming language and Spark Core.
- Worked wif AWS based data ingestion and transformations.
- Worked wif Cloud Break and Blue Print to configure AWS plotform.
- Worked wif data warehouse tools like Informatica, Talend.
- Experienced in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Good knowledge on Amazon EMR, Amazon RDS S3 Buckets, Dynamo DB, RedShift.
- Analyze data, interpret results, and convey findings in a concise and professional manner
- Partner wif Data Infrastructure team and business owners to implement new data sources and ensure consistent definitions are used in reporting and analytics
- Promote full cycle approach including request analysis, creating/pulling dataset, report creation and implementation and providing final analysis to the requestor
- Good experience on Kafkaand Storm.
- Worked wif Docker to establish connection between Spark and NEO4J database.
- Knowledge of java virtual machines (JVM) and multithreaded processing.
- Hands on experience working wif ANSI SQL.
- Strong programming skills in designing and implementation of applications using Core Java, J2EE, JDBC, JSP, HTML, Spring Framework, Spring batch framework, Spring AOP, Springboot, Struts, JavaScript, Servlets.
- Experience in build scripts using Maven and do continuous integrations systems like Jenkins.
- Java Developer wif extensive experience on various Java Libraries, API’s,and frameworks.
- Hands on development experience wif RDBMS, including writing complex Sql queries, Stored procedure,and triggers.
- Very Good understanding of SQL, ETL and Data Warehousing Technologies
- Knowledge of MS SQL Server 2012/2008/2005 and Oracle 11g/10g/9i and E-Business Suite.
- Expert in TSQL, creating and using Stored Procedures, Views, User Defined Functions, implementing Business Intelligence solutions using SQL Server 2000/2005/2008.
- Developed Web-Services module for integration using SOAP and REST.
- NoSQL database experience onHBase, Cassandra,DynamoDB.
- Flexible wif Unix/Linux and Windows Environments working wif Operating Systems like Centos 5/6, Ubuntu 13/14, Cosmos.
- Has sound knowledge on designing data warehousing applications wif using Tools like Teradata, Oracle,and SQL Server.
- Experience working wif Solr for text search.
- Experience on using Talend ETL tool.
- Experience in working wif job scheduler like Autosys and Maestro.
- Strong in databases like Sybase, DB2, Oracle, MS SQL,Clickstream.
- Strong understanding of Agile Scrum and Waterfall SDLC methodologies.
- Strong Working experience in snowflake.
- Hands on experience wif automation tools such as Puppet, Jenkins,chef,Ganglia,Nagios.
- Strong communication, collaboration & team building skills wif proficiency at grasping new Technical concepts quickly and utilizing them in a productive manner.
- Adept in analyzing information system needs, evaluating end-user requirements, custom designing solutions and troubleshooting information systems.
- Strong analytical and Problem solving skills.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Spark, Zookeeper and Cloudera Manager,Splunk.
NO SQL Database: HBase, Cassandra
Monitoring and Reporting: Tableau, Custom shell scripts
Hadoop Distribution: Horton Works, Cloudera, MapR
Build Tools: Maven, SQL Developer
Programming & Scripting: JAVA, C, SQL, Shell Scripting, Python, Scala
Java Technologies: Servlets, JavaBeans, JDBC, Spring, Hibernate, SOAP/Rest services
Databases: Oracle, MY SQL, MS SQL server, Teradata
Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript, angular JS
Version Control: SVN, CVS, GIT
Operating Systems: Linux, Unix, Mac OS-X, Cen OS, Windows10, Windows 8, Windows 7, Windows Server 2008/2003
Hadoop/Big Data Technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Spark, Zookeeper and Cloudera Manager,Splunk.
NO SQL Database: HBase, Cassandra
PROFESSIONAL EXPERIENCE
Confidential -Dallas, TX
Big Data Engineer
Responsibilities:
- Developed Data Pipeline wif Kafka and Spark.
- Contributedindesigning the Data Pipeline wif Lambda Architecture.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Involved in installation, configuration, supporting and managing Hadoop clusters.
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Used Spark for interactive queries and processing of streaming data.
- Expansively worked wif Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Developed Spark Applications by using Scala,Pythonand Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Worked wif the Spark for improving performance and optimization of the existing algorithms in Hadoop.
- Using Spark Context, Spark-SQL, Data Frame, Spark Yarn.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
- Configured a data model to get data from Kafka in near real time and persist it to Cassandra.
- Developed Kafka consumer API in Python for consuming data from Kafka topics.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming wif Kafka as a data pipe-line system.
- Migrated an existing on-premises application to AWS.
- Used AWS services like EC2 and S3 for small data sets processing and storage.
- Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDDs.
Environment: Big Data Horton Work, Apache Hadoop, Hive, Python, Hue Tool, Zookeeper, Map Reduce, Sqoop, crunch API,Pig 0.10 and 0.11, HCatalog, Unix, Java, JSP, Eclipse, Maven, Oracle, SQL Server, Linux,MYSQL.
Confidential -San Francisco, CA
Big Data Engineer
Responsibilities:
- Communicated deliverables status to stakeholders and facilitated periodic review meetings.
- Developed Spark streaming application to pulldata from cloud to Hive and HBase.
- Built Real-Time Streaming Data Pipelines wif Kafka, Spark Streaming and Hive.
- Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
- Handled schema changes in data stream using Kafka.
- Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.
- Coordinated Kafka operation and monitoring wif dev ops personnel; formulated balancing impact of Kafka producer and Kafka consumer message(topic) consumption.
- Designed and developed ETL workflows using Python and Scala for processing data in HDFS.
- Collected, aggregated, and shuffled data from servers to HDFS using Apache Spark & Spark Streaming.
- Worked on importation and claims information between HDFS and RDBMS.
- Created Hive External tables and loaded the data into tables and query data using HQL.
- Worked on streaming the prepared information to HBase utilizing Spark.
- Performed performance calibration for Spark Steaming e.g., setting right Batch Interval time, correct level of executors, choice of correct publishing& memory.
- Used HBase connector for Spark.
- Performed gradual cleansing and modeling of datasets.
- Utilized Avro-tools to build the Avro schema to create external hive tables using PySpark.
- Created and managed externaltables to store ORC and Parquet files using HQL.
- Developed Apache Airflow DAGs to automate the pipeline.
- Created a NoSQL HBase database to store the processed data from Apache Spark.
Environment: Snowflake Web UI, Snow SQL, Hadoop MapR 5.2, Hive, Hue, Azure, Control-M, AWS, Teradata Studio, Oracle 12c, Tableau, Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Spark MLlib
Confidential -San Diego, CA
Data Engineer/Hadoop Spark Developer
Responsibilities:
- Extensively worked wif Spark-SQL context to create data frames and datasets to preprocess the model data.
- Data Analysis: Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
- Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
- Wrote Junit tests and Integration test cases for those Microservice.
- Worked in Azure environment for development and deployment of Custom Hadoop Applications.
- Work heavily wif Python, C++, Spark, SQL, Airflow, and Looker
- Experienced in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
- Built pipelines to move hashed and un-hashed data from XML files to Data lake.
- Developed NiFi workflow to pick up the multiple files from ftp location and move those to HDFS on daily basis.
- WrittenTemplatesforAzure Infrastructure as codeusingTerraformto build staging and production environments. IntegratedAzure Log AnalyticswifAzure VMsfor monitoring thelog files, store them and track metrics and usedTerraformas a tool,Manageddifferent infrastructure resourcesCloud,VMware, andDockercontainers.
- Scripting: Expertise in Hive, PIG, Impala, Shell Scripting, Perl Scripting, and Python.
- Worked wif developer teams on NiFi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka.
- Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
- Proven experience wif ETL frameworks (Airflow, Luigi, or our own open sourced garcon)
- Created Hive schemas using performance techniques like partitioning and bucketing.
- Used Hadoop YARN to perform analytics on data in Hive.
- Developed and maintained batch data flow using HiveQL and Unix scripting
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Build large-scale data processing systems in data warehousing solutions, and work wif unstructured data mining on NoSQL.
- S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
- Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala.
- Developed customized Hive UDFs and UDAFs in Java, JDBC connectivity wif hive development and execution of Pig scripts and Pig UDF’s.
Environment: Hadoop, Microservices, Java, MapReduce, Agile, HBase, JSON, Spark, Kafka, JDBC,AWS, EMR/EC2/S3,Hive, JSON, Pig, Flume, Zookeeper, Impala, Sqoop
Confidential
Hadoop Developer
Responsibilities:
- Used Sqoop to expeditiously transfer information between information databases and HDFS and used Flume to stream the log data from servers.
- Enforced partitioning, bucketing in Hive for higher organization of the data.
- Worked wif totally different file formats and compression techniques to standards.
- Loaded information from a UNIX system to HDFS.
- Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.
- Assigned in production support, that concerned observance server and error logs, and foreseeing and preventing potential problems, and escalating issue once necessary.
- Documented Technical Specs, Dataflow, information Models, and sophistication Models using Confluence.
- Documented needs gatheird from stakeholders.
- Wif success loaded files to HDFS from Teradata and loaded from HDFS to HIVE.
- Used Zookeeper and Oozie for coordinating the cluster and programming workflows
- Involved in researching various available technologies, industry trends, and cutting-edge applications.Data ingestion is done using Flume wif source as Kafka Source & sink as HDFS.
- Performed storage capacity management, performance tuning, and benchmarking of clusters.
Environment: Hadoop, Zookeeper, Kafka, UNIX
Confidential
Data Engineer
Responsibilities:
- Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
- Transformed batch data from several tables containing hundreds of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames usingPySpark.
- Developed aPySparkprogram that writesdataframesto HDFS asavrofiles.
- Utilized Spark's parallel processing capabilities to ingest data.
- Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
- Developed a Script that copiesavroformatted data from HDFS to External tables in raw layer.
- CreatedPySparkcode that uses Spark SQL to generatedataframesfromavroformatted raw layer and writes them to data service layer internal tables as orc format.
- In charge ofPySparkcode, creatingdataframesfrom tables in data service layer and writing them to a Hive data warehouse.
- Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
- Configured documents which allow Airflow to communicate to its PostgreSQL database.
- Developed Airflow DAGs in python by importing the Airflow libraries.
- Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.
Environment: Cloudera Manager, HDFS, Sqoop, Pig, Hive, Oozie, Spark SQL, Tableau, My SQL, Python, Kafka, flume, Java, Scala, Git.