We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Tampa, FL

SUMMARY

  • Senior IT professional with over 8+ years of experience as a Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem,Big Data Analytical,Cloud Data engineering,Data Warehouse/ Data Mart,Data Visualization,Reporting, and Data Quality solutions.
  • Strong Capable experience in working with SDLC, Agile and Waterfall Methodologies.
  • Good exposure working with Hadoop distributions such Cloudera, Hortonworks and, Data Bricks.
  • Excellent understanding of HDFS, MapReduce, Yarn and tools including Hive, Impala and Pig for data analysis, Sqoop and Flume for data ingestion, Air ow for work ow scheduling.
  • Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks.
  • Experience in Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
  • Implemented large scale technical solutions using Object Oriented Design and Programming concepts using Python.
  • Evaluate Snowflake Design considerations for any change in the application.
  • Implemented various algorithms for analytics usingCassandrawithSpark and Scala.
  • Experienced in running query - usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
  • Highly experienced in importing and exporting data betweenHDFSandRelational Database Management systemsusingSqoop.
  • Experience in creating Hive Tables, loading with data and writing Hive queries, which will invoke and run MapReduce jobs in the backend.
  • Hands on experience with Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Data bricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes
  • Haveexperience ininstalling,configuringandadministratingHadoop cluster for major Hadoop distributions likeCDH4, and CDH5.
  • Hands on experience on Kafka and Flume to load the log data from multiple sources directly in to HDFS.
  • Strong experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
  • Proficient in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Good understanding of RDBMS Like Oracle, MY SQL, SQL Server and also NoSQLData bases and hands on work experience in writing applications on No SQL data bases likeHbase, CassandraandMongo DB.
  • Excellent SQL programming skills and developed Stored Procedures, Triggers, Functions, Packages using SQL, PL/SQL.
  • Used advanced features of T-SQL to design and tune interface with the database and other applications in the most efficient management.
  • Good working knowledge in OLAP, OLTP, Business Intelligence and Data Warehousing concepts with emphasis on ETL and Business Reporting needs.
  • Experience in developing Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Extensive knowledge on designing Reports, Scorecards, and Dashboards using Power BI
  • Extensively usedHivequeries to query data in Hive Tables and loaded data intoHBasetables.
  • Flexible working Operating Systems like Unix/ Linux (Centos, Redhat, Ubuntu) and Windows Environments.
  • Developed Oozie workflows to automate ETL process by scheduling multiple Sqoop, Hive and Spark jobs
  • Solid knowledge of Microsoft Office with an emphasis on Excel.
  • Having experience in developing story telling dashboards data analytics, designing reports with visualization solutions using Tableau Desktop and publishing on to the Tableau Server.
  • Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills.

TECHNICAL SKILLS

Programming Languages: Java, PL/SQL, SQL, Python, Scala, PySpark

Big Data/Hadoop Eco System: Hadoop, MapReduce, Kafka, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, Hbase, Zookeeper

Cluster Mgmt& monitoring: CDH 4, CDH 5, Horton Works Ambari 2.5

Data Bases: MySQL, SQL Server, Oracle 10g/11g/12c, MS Access, Snowflake

NoSQL Data Bases: MongoDB, Cassandra, HBase

Workflow mgmt. tools: Oozie, Apache Airflow

Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend

Cloud Technologies: Microsoft Azure, AWS (Amazon Web Services)

IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems: Git, SVN

Operating Systems: Unix, Linux, Windows

PROFESSIONAL EXPERIENCE

Confidential, Tampa, FL

Big Data Engineer

Responsibilities:

  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing
  • Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology
  • Converted Talend Joblets to support the snowflake functionality.
  • Created Dax Queries to generated computed columns in Power BI.
  • Published reports and dashboards using Power BI.
  • Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in Azure Databricks
  • Build the Logical and Physical data model for snowflake as per the changes required.
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Installed and configured apache airflow for workflow management and created workflows in python
  • Write UDFs in Hadoop PySpark to perform transformations and loads.
  • Use NIFI to load data into HDFS as ORC files.
  • Extensively used Azure Big data technologies like Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Factory and created POC in moving the data from flat files and SQL Server using U-SQL jobs.
  • Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
  • Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
  • Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
  • Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
  • Developing automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script
  • Integrated Flume with Kafka, and Worked on monitoring and troubleshooting the Kafka-Flume-HDFS data pipeline for real-time data ingestion in HDFS
  • Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services.
  • Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Designed and implemented end to end big data platform on Teradata Appliance
  • Created tables and views on snowflake as per the business needs.
  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Compiled data from various sources to perform complex analysis for actionable results
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Developing ETL pipelines in and out of data warehouse using Snowflakes SnowSQL Writing SQL queries against Snowflake
  • Created SQL tables with referential integrity, constraints and developed queries using SQL, SQL*PLUS and PL/SQL.
  • Implemented near real time data pipeline using framework based on Kafka, Spark.
  • Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
  • Source Analysis, tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
  • Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for the applications
  • Fixed Kafka and Zookeeper related Production issues across multiple clusters.

Environment: Apache Spark, Hadoop, Power BI, Power Shell, Talend, PySpark, HDFS, Cloudera, Kafka, Snowflake, Ant, Maven, Azure, Data Bricks, Data Lake, Data factory, Python, Nifi, JSON, Teradata, DB2, PL/SQL, SQL Server, MongoDB, Shell Scripting, Zookeeper

Confidential, El Segundo, CA

Big Data Engineer

Responsibilities:

  • Performed data quality issue analysis using SnowSQL by building analytical warehouses on Snowflake.
  • Used AWS components such as EC2, S3, Lambda, Auto Scaling, Cloud Watch, Cloud Formation, Security groups and IAM.
  • Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi, Druid
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Design/Implement large scale pub-sub message queues using Apache Kafka
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Setup/Optimise ELK {Elasticsearch, Logstash, Kibana} Stack and Integrated Apache Kafka for data ingestion
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
  • Creating Reports in Looker based on Snowflake Connections.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Connected to Amazon Redshift through Tableau to extract live data for real time analysis.
  • Exception handling in python to add logs to the application.
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
  • Created Kafka producer API to send live-stream data into various Kafka topics.
  • Implemented data intelligence solutions around Snowflake Data Warehouse.
  • Designed and implemented big data ingestion pipelines to ingest multi TB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats Performing data wrangling on Multi-Terabyte datasets from various data sources for a variety of downstream purposes such as analytics using PySpark.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics
  • Implementing and Managing ETL solutions and automating operational processes.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Developed the batch program in PL/SQL for the OLTP processing and used Unix Shell scripts to run in corn tab.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Collaborate with team members and stakeholders in design and development of data environment
  • Optimized the TensorFlow Model for efficiency

Environment: Hadoop, Cloudera, HDFS, Map Reduce, Kafka, Hive, Python, Redshift, Snowflake, SnowSQL, Informatica, AWS,EC2, S3, PL/SQL, Oracle 12c, Erwin, RDS, NOSQL, MySQL, Dynamo DB, PostgreSQL, Tableau, Git Hub

Confidential

Hadoop Developer/ Data Engineer

Responsibilities:

  • Extracted the needed data from the server into HDFS andBulk Loadedthe cleaned data intoHbase.
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Developed and Configured Kafka brokers to pipeline server logs data into Spark streaming.
  • Defined job work flows as per their dependencies inOozie.
  • Designed, implemented and deployed within a customer’s existingHadoop / Cassandracluster a series of custom parallel algorithms for various customer defined metrics and unsupervised learning models.
  • Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
  • Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.
  • Installed and configuredHive, Pig, Sqoop, FlumeandOozieon the Hadoop cluster.
  • Estimated the Software & Hardware requirements for theName NodeandData Node& planning the cluster.
  • Successfully Generated consumer group lags from Kafka using their API
  • Deployed an ApacheSolr/Lucenesearch engine server to help speed up the search of financial documents.
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS)
  • Created Hive tables and involved in data loading and writing Hive UDFs.
  • Played a key role in productionizing the application after testing by BI analysts.
  • Given POC ofFLUMEto handle the real-time log processing for attribution reports.
  • Maintain System integrity of all sub-components related to Hadoop.
  • Wrote queries UsingDataStax Cassandra CQLto create, alter, insert and delete elements.
  • Written the Map Reduce programs,HiveUDFsin Java.

Environment: Apache Hadoop, HDFS, Spark, Kafka, Solr, Hive, Python, DataStax Cassandra, Map Reduce, Pig, Java, Flume, Cloudera CDH4, Oozie, Oracle 11g, MySQL, AWS.

Confidential

Hadoop Developer

Responsibilities:

  • Installed and configured Hadoop MapReduce HDFS Developed multipleMapReducejobs in java for data cleaning and pre-processing.
  • Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
  • Performed data validation and transformation using Python and Hadoop streaming.
  • Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
  • Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
  • Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
  • Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS
  • Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
  • Extensively used DB2 Database to support the SQL
  • Automated workflows using shell scripts and Control-M jobs to pull data from various databases into Hadoop Data Lake.

Environment: Hadoop, HDFS, MapReduce, Spark, Hive, Pig, Hbase, DB2, Java, Python, Oracle 10g, SQL, Splunk, Unix, Shell Scripting.

We'd love your feedback!